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Abstract 

Many recent efforts have been devoted to designing so¬ 
phisticated deep learning structures, obtaining revolution¬ 
ary results on benchmark datasets. The success of these 
deep learning methods mostly relies on an enormous vol¬ 
ume of labeled training samples to learn a huge number 
of parameters in a network; therefore, understanding the 
generalization ability of a learned deep network cannot be 
overlooked, especially when restricted to a small training 
set, which is the case for many applications. In this pa¬ 
per, we propose a novel deep learning objective formula¬ 
tion that unifies both the classification and metric learning 
criteria. We then introduce a geometry-aware deep trans¬ 
form to enable a non-linear discriminative and robust fea¬ 
ture transform, which shows competitive performance on 
small training sets for both synthetic and real-world data. 
We further support the proposed framework with a formal 
( K , e)-robustness analysis. 

1. Introduction 

Many recent efforts have been devoted to learning a map¬ 
ping from low-level image features, e.g., image patches fl6l 
flTi . LBP descriptors 0 [14), to high-level discriminative 
representations. The learned feature mapping often in¬ 
creases the inter-class separation while reducing the intra¬ 
class variation. This idea dates back at least to the linear 
discriminant analysis (LDA) for linear cases; however, if we 
allow the feature mapping to be non-linear, e.g., deep con¬ 
volutional neural network El HUE), the discriminability 
of the learned representation is often significantly enhanced 
compared to its linear counterpart. 

Deep learning techniques achieve unprecedentedly high 
precision in object and scene classification, where an enor¬ 
mous volume of labeled training samples are often required 
to learn a rich set of parameters 0 na da. Despite 
such revolutionary advances, many real-world classification 
problems remain challenging, due to the large number of 
non-linearly separable classes and the scarcity of training 


samples. One such example is face verification a, where 
recently reported successes mostly rely on huge proprietary 
training sets, e.g., 4.4 million labeled faces from 4,030 peo¬ 
ple in IfTTl ; however, publicly available training datasets of¬ 
ten consist of only a small set of subjects with several sam¬ 
ples per subject. It is a notoriously difficult task to learn 
from limited training samples a deep structure that can gen¬ 
eralize well on testing data El- 

While great current attentions are paid to smart manip¬ 
ulation of different deep architectures for more discrimina¬ 
tive representations mmm in this paper, we focus on 
the generalization problem, i.e., how to encourage a map¬ 
ping learned from limited training samples to generalize 
well over testing data. This issue is of significant impor¬ 
tance when the training samples are scarce, in which case 
the network optimized on the training set is likely at the 
risk of overfitting. We provide both analytic and experi¬ 
mental illustrations on the generalization errors of a learned 
deep structure, under several popular objective functions. 

We further propose a geometry-aware feature transfor¬ 
mation framework, which balances between discriminabil¬ 
ity and generalization. The proposed framework encour¬ 
ages inter-class separation while at the same time penalizes 
the distortion of intra-class structure. This also extends the 
“shallow” setup in El to a deep architecture, also provid¬ 
ing theoretical insights regarding robustness. In particular, 
we show that constraining feature mapping functions to be 
near-isometry in local sub-regions yields robust algorithms. 
We first motivate our framework with a synthetic example, 
and then support it through theoretical analysis. We further 
validate our framework using face verification experiments 
and report state-of-the-art results on the challenging LFW 
face dataset. 

Our main contributions are: 

• proposing a novel deep learning objective that unifies 
the classification and metric learning criteria. 

• providing a theoretical argument showing that aware¬ 
ness of geometry leads to robustness; 
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• motivating a general algorithmic framework which 
considers data geometry in the formulation; 

• designing a learned deep transform, as a particular 
example of the proposed geometric framework, that 
achieves state-of-art results. 

2. Geometry-aware deep transform 

Deep networks are often optimized for a classification 
objective, where class-labeled samples are input as train¬ 
ing |6l jTOl j~6l' f7l; or a metric learning objective, where 
training data are input as positive and negative pairs EE). 
[[] In this section, we first propose a novel deep learning 
objective that unifies the classification and metric learning 
criteria. We then introduce a geometry-aware deep trans¬ 
form, and optimize it through standard back-propagation. 
We further support the proposed framework with a formal 
( K , e)-robustness analysis fl8l . 

2.1. Pedagogic formulation 

We use the following two-class problem as an illustration 
example: The first class is generated as x = Uv/||Uv||, 
where v is with probability (w.p.) 1/2 from a constrained 
plane —y + z=l,xG [—1,1 ],z € [—3, 0], and w.p. 1/2 
from plane y + z = l,x £ [—1,1], 2 £ [0,3]. U is a 
d x 3 (d > 3, d = 100 in this case) matrix that embeds 
x into a d-dimensional space. Similarly, the second class 
is generated as x = Uu/||Uu||, where u is w.p. 1/2 from 
— y + z = —1, x £ [—1,1], ^ £ [—3,0], and w.p. 1/2 from 
y + z = —1, x £ [—1,1], z £ [0,3]. For each class, 40 train¬ 
ing and 1000 testing samples are generated. Fig. |T| visual¬ 
izes the training and testing data by randomly projecting it 
to a 3 dimensional coordinate system, with different colors 
representing different classes. Observe that the two classes 
are not linearly separable, which necessitates a non-linear 
feature transform. 



(a) Training samples: 40 per class, (b) Testing samples: 1000 per class. 


Figure 1: Training and testing samples. 


1 A positive pair contains two samples from the same class, and a nega¬ 
tive pair contain two samples from different classes. 



Figure 2: Transformed features using a metric learning for¬ 
mulation. 



(a) Transformed training samples. (b) Transformed testing samples. 


Figure 3: Transformed features using a classification for¬ 
mulation. 



(a) transformed training samples (b) transformed testing samples 


Figure 4: Transformed features using GDT with A = 0.4. 


We want to learn a mapping /(x) that transforms the 
low-level feature x to a more discriminative one. In this pa¬ 
per, we are particularly interested in non-linear transforms 
/(•) implemented as a deep neural network. However, the 
method and theory we develop are general in the sense that 
any other family of /(•) can be adopted as well. In this 
example, /(•) is implemented as a 2-layer fully connected 
neural network with tanh as the squash function, /(•) tak¬ 
ing the form 


/(x) = tanh(A 2 tanh(Aix)), (1) 
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where Ai, A 2 £ K dxd are the linear coefficients in those 
two layers. 

Metric learning formulation. The general goal of metric 
learning is to ensure that, after the transform, the distance 
between intra-class points is small, while the inter-class dis¬ 
tance is large. The Euclidean distance is a common choice 
of metric; however, empirical results @1 have shown 
that the cosine distance outperforms Euclidean distance on 
certain tasks such as face verification. Moreover, cosine dis¬ 
tance is bounded and easier for us to design the loss func¬ 
tion. We therefore adopt the cosine distance in this paper, 
and propose the metric learning formulation 


min 

Ai, a 2 


Y' ( /( x ») T /(xj) 

ZZ V II/Mil • ||/(Xj)|| 



( 2 ) 


where the indicator 



if x,;. Xj £ same class, 
otherwise. 


Notice that [| /^(^)]| ||^f || ^ [—1,1] is the cosine of the 

angle between the transformed features /(xj) and 
The objective of (|2| is to encourage the intra-class angle to 
be close to 0, and the inter-class ones to be as separated as 
7 r. We use back-propagation to optimize the parameters, Ai 
and Ao, as explained later. 

Fig. [2a] visualizes the transformed training samples by 
the learned /(•). The learned transform significantly pulls 
apart the two classes and reduces the variations within each 
individual class. We then apply the learned /(•) to the test¬ 
ing samples (fig. [2b] ). However, we observe that the two 
classes are not well separated, raising our concerns about 
the robustness of the pure metric learning formulation in 

0- 

Classification formulation. Now let us consider a differ¬ 
ent objective function, where we encourage the intra-class 
angles to be preserved after the transformation. This new 
objective has a unified formulation as <0- but now the indi¬ 
cator becomes 




11 x tn^ J I, ifx,. x, £ same class, 

— 1 otherwise. 


is implemented as the 2 -layer neural network as described 
before, and optimized through back-propagation. 

The transformed training and testing samples are visu¬ 
alized in Fig. [3] Comparing Fig. [2] and Fig. [3] we observe 
that although our metric learning formulation works well 
on the training data, it does not well discriminate the two 
classes on testing data, i.e., it has a big generalization er¬ 
ror. In contrast, following the classification formulation, 
the intra-class variance is not reduced, yet the deterioration 
from training to testing is not so significant. In other words, 
while the metric learning formulation is too optimistic about 
the discrimination we can achieve, the classification formu¬ 
lation is more robust but conservative. 


2.2. Proposed formulation and algorithm 

We introduce now a geometry-aware deep transform. We 
use /„(■) to denote the feature transform, to emphasize that 
a are parameters to be learned, e.g., filters in a neural net¬ 
work (Ai, A 2 in the previous section). f„ can be a linear 
function or a non-linear function implemented by a neural 
network. 

We formulate the transformation learning problem as: 
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min - 
a 2 



/a(x i ) T / a (x j ) 
ll/a( x i)|| • ||/a(Xj)|| 



(3) 


where the indicator 


A + (1 - || if Xj, Xj G same class, 

— 1 otherwise. 


and A £ [0,1]. We denote formulation Q as Geometry- 
aware Deep Transform (GDT). The GDT objective is a 
weighted combination of the two pedagogic formulations 
discussed above. We can understand it as regularizing the 
metric learning formulation using the classification one. 

We use gradient descent (back-propagation if /„(•) is a 
deep neural network) to solve for the a in (J3J. In particular, 
let us denote the objective in (J3} as J and define 

/«(xi) = y i, 

/a(Xi)/ a (Xj) 4 (4) 

ll/«(Xi)|| • ||/«(x i )|| 


We denote this objective function as a classification for¬ 
mulation, as it shares similar attributes to the classifica¬ 
tion objective commonly optimized for a deep network 
naira. Explicit constraints are imposed to separate differ¬ 
ent classes, e.g., tj j = — 1 for negative pairs here, but only 
weak constraints are used to assign similar representation to 
the same class. This classification formulation is less am¬ 
bitious than the metric learning formulation, as it does not 
require the variance in the same class being reduced. /(•) 


Then we have 

yj _ r y» 

Ill'll 1,3 ' l|y»llJ ’ 

(5) 

-#2- can be calculated in the same manner. Then we back- 

dyj 

propagate this gradient through the network to update all the 

parameters. More specifically, we denote a^ k> as the filter 

(k) 

weights and bias in the fc-th (1 < k < K) layer. And x) 


dyi y, U ’ j) 
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as the output of the fc-th layer excited by the input x- 
(therefore y, ; = x ? - A ^ and x, ; = x- 0) ). Then, 

dJ dJ dy i 


(fc-i) 


da .( K ) 

dJ 

da( k '> 


= £ 


9yj da.R'i ’ 

3J <9xf +1) <9x„ (fe) 


9 x- fe+1) < 9 x, {fc) 3a( fe ) 


1 < k < K - 1. 


(6) 


An overview of the GDT algorithm is summarized in Algo¬ 
rithm [Tj 


Algorithm 1 Gradient descent solver for GDT 

Input: A £ [0,1], training pairs {(x.j, Xj,lij)}, 

a defined A'-layer network (/„ (■) family), stepsize 7 

Output: a 

while stable objective not achieved do 
Compute yi = / a (xj) by a forward pass 
Compute objective J 


Compute 


dJ 
dyt 


Eq. (|5} 


for k = K down to 1 do 


Compute y 
a<— a( k ~) — 

end for 
end while 


Eq. (| 6 |i 


ry-dJ_ 


For an illustration of Algorithm [T] we apply it with 
A = 0.4 to the illustrative example above. The transformed 
training and testing samples are shown as Fig. [4] Compared 
with the two pedagogic formulations (equivalent to A = 1 
and 0 in GDT respectively), this A = 0.4 case is balancing 
between discriminability and robustness. Before more de¬ 
tailed experimental analysis are shown in Section[3] we now 
provide theoretical insights to support our robustness claim. 

2.3. (K , e) -robustness 

GDT regularizes discriminative transform learning with 
intra-class structure preservation. In this section, we for¬ 
mally show that a local isometry regularization induces ro¬ 
bustness. In the following, we assume a general objective 
that works with distance metrics of pairs of transformed fea¬ 
tures. 

Let the low-level feature space be X, and the class label 
set be y = {1,..., A}, where L is the number of classes. 
Z = X x is the set of low-level features and their corre¬ 
sponding labels. The training set is 

T= {(xi,yi),...,(x n ,y n )} = {zi,...,z„} e Z n , 

which consists of n i.i.d. samples drawn from an un¬ 
known distribution V defined on Z. The feature mapping is 
f a (x) : T x> J, where T is the transformed feature space. 

Denote p as a metric endowed with X and T . Define pair 
label lip = 1 yi = yj, and —1 otherwise. We may adopt 


a loss function q(p(f a (x-i), 1 ) that encourages 

X/XxX/ 0 (xJ)~bL s „,,,Wg,ifC - 1 (-D. We 

require the Lipschtiz constant of g(-, 1) and g{-, —1) to be 
upper bounded by A (0 < A < 00 ). Examples of such g 
include the hinge loss 

max(—^ 7(1 - p(/ a (x i ),/ ce (xj))),0), (7) 

as well as its smoothed version 

log(l + e -A,At-p(/«(x,),/«(x 3 )))) i (8) 


both of which are commonly adopted in the metric learning 
literature 0 . In GDT formulation 0, the quadratic loss has 
bounded Lipschtiz w.r.t. the cosine distance GY 7 as well. In 
the following, we denote 

d(p{fa ( x i) j fa ( x j ) ) j li j) = ha{Zi,Zj) 


for short. 

The empirical loss on the training set (associated with 
parameter a) is 


Rempi,^-') — / ^ hai^Zii Zj'). 

n(n — 1 

V ' i,j=l 


(9) 


And the expected loss is 

R(a) = E z j,z^~-D [h a (z'i,z' 2 )]. (10) 

The algorithm is a program that seeks 

a 7 - = arg min R emp (a), (11) 


which minimizes the empirical loss on the training set 
T. A metric learning type formulation, including our 
GDT, falls in the category of algorithm ( fTT) . The quantity 
Remp^O-r) — R{olt) is called the algorithm’s generaliza¬ 
tion error. Smaller generalization error implies robustness. 

The work ED proposes a notion called (K , e)- 
robustness, and 0 extends the definition of robustness to 
algorithms like © that work on pairs of samples. It also 
shows that (AT, e)-robust algorithms have generalization er¬ 
ror bounded as 

Rempi&r) - R{olt) < e + O ■ ( 12 ) 

We now rephrase the definition of (AT, e)-robustness in 0: 

Definition 1. The algorithm ( | 11 1 ) is (AT, e)-robust if Z can 
be partitioned into I\ disjoint set, {Cfc }^=v slt °h that for 
all R £ Z n , the learned aj- satisfies: 

Vz i , Zj £ T where i j, 

Vzj, zl 2 £ Z, 

Ifzi,z[ £ C p , and zj , z' 2 £ C q for any p,q £ {1, ..., AT}, 
then 

|/i„ r (zi,Zj) - h aT ( Zi,z' 2 )| < e. 
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According to Definition [I] the (K, e)-robustness essen¬ 
tially requires that with the learned a 7 -, a testing pair 
(zj, zj) incurs a similar loss with any training pair (zZj) 
that is in the same subset (in a pair-wise sense). And ac¬ 
cording to the generalization error bound ( p~ 2 ] ), the smaller e 
is, the smaller the generalization error tends to be; therefore 
the more robust the algorithm is. 

Before presenting our theory, we need to introduce the 
covering number, defined as follows: 

Definition 2. For a metric space (S. p), we say that S C S 
is a 7 - cover of S, if Vs £ S, 3s £ S such that p( s, s) < 7 . 
The 7 - covering number of S is 

A f 7 (<S, p) = min{|<S| : S is a 7 - cover ofS} 

Remark 1. The covering number describes how many balls 
(in p metric sense) we need to “cover” a space. Fea¬ 
ture space of certain property, e.g., Gaussian distributed, 
sparsely representable Hl3t . has certain covering number. 
The more complex the feature space is, the more balls we 
need to cover it. In a word, covering number reflects the 
geometry of the set S. In particular, we notice that the set 
S with covering number Af-y/ 2 (S, p) can be partitioned into 
Af-fpiiS, p) disjoint subsets, such that any tw’o points within 
the same subset are separated by no more than 7 . 

Lemma 1 . Z can be partitioned into LAf^/ 2 {X , p) sub¬ 
sets, denoted as Z \,..., Z LJ g-^ , 2 ^ x p y such that for all 

zi = (xi, yi), Z 2 = (X 2 , yf) belonging to any one of these 
subsets, y± = y 2 and p(x i,X 2 ) < 7 . 

Proof. As noticed immediately after Definition [T] we can 
partition X into AT 7 / 2 (X, p) disjoint subsets, each with di¬ 
ameter no bigger than 7 . Then we can partition Z = X x y 
into LAfj/ 2 (X , p) disjoint subsets, such that any two sam¬ 
ples (xi, t/i), (X 2 , 2 / 2 ) in any one of these subsets have 
2 /i = 2/2 and p(x u x 2 ) < 7 . □ 

Lemma [T] also implies a partition of X, denoted as 
Xi,, Xuj- , 2 (x tP ) such that any Xj, Xj from the same 
subset have p{x t , x ? ) <7 and share the same label. 

Theorem 1. If f a (x) is a 5-isometry (i.e., distance dis¬ 
torted by at most 5 after the transform) within each of 
Xi,.. ., X L jg-^ /2 (x,p) as described above, then an algorithm 
in the category of © is {LAf j/2 (X, p),2A('y + 5))-robust. 


Proof. The proof follows the definition of (AT, e)- 
robustness. We pick any training samples zz j and testing 
samples zj,zj such that z^,zj £ Z p and Zj,Z 2 £ Z q for 
some p, q £ {1,..., LAf 1 / 2 {X , p)}. Then 

P(x i; xj) < 7 and p(x J ; x^) < 7 . 


Notice that x,, xj 7 X p and x. ; , xj £ X q . Therefore by the 
f-isometry definition, 

|p(/«(xi),/„(xj)) - p(x s ,xj)| < 5, 

and 

IP(/a(Xj),/ a (xj)) - p(Xj,X2)| < 5. 

Rearranging the above gives 

p(,fa (Xj) , f ol (xj )) < p(x,;,xj) +5 <7 + 5, 

and 

p(/«(xi),/«(x , 2 )) < p(xj,xj) + 5 <7 + 5. 

We need to bound the difference between 
p(/a(xi),/«(xj)) and p(/ a (xj),/ a (xj)) so that we 
can further invoke the finite Lipschtiz assumption to bound 
the quantity \h a (zi,Zj) — h a { zj,zj)|. Specifically, 

I p{f a (xi),f a (xj)) - p(/ a (xj),/ a (x' 2 ))| 

< Ip(/a(x i ),/ a (x j )) -p(/ a (xj),/„(x i ))| 

+ |p(/a(xj),/a(Xj-)) - p(/a(xj),/ Q (xjj))| 

< P(/a (Xi), foe (xj)) + p(f a (:Xj ) , f a (x' 2 ) ) 

< 2(7 + 5), 

where the second line follow from the triangle inequality, 
while the third line follows the definition of metric. No¬ 
tice that 2 li = y'i and 2/7 = y 2 - Therefore |/i„(zj, Zj) — 
h a ( zj,zj)| is either 

|s(p(/ a (Xi), /a(Xj)), 1) - p(p(/a(xj), /a(xj)), 1)|, 


or 


|ff(p(/«(Xi), /«(Xj)), -1) - p(p(/«(xj), /„(x' 2 )), —1)1- 

Since the Lipschtiz constants of </(-, 1) and (/(•, —1) are no 
bigger than A, we have 

I h a (z.j , Zj) - h a ( zj,z' 2 )| 

< 2 l|p(/ Q (x i ),/ a (x i ))-p(/«(xj),/ ct (x , 2 ))| 

< 2A(7 + 5), 

which concludes the proof. □ 


Remark 2. Theorem [7] tells us that the algorithm will be 
robust if we constrain the function f a (-) to be near isomet¬ 
ric in local regions. And the robustness depends on how 
much of an isometry /„(•) is in the local regions. The local 
regions are jointly defined by the class labels and the cover¬ 
ing number, which, as we described in remark^ 7] depicts the 
geometry of the low-level feature space. Given that the al¬ 
gorithm is (K , 2(7 + 5)) robust, by Eq. m we can bound 
the generalization error of algorithms that belongs to the 
category of Cl} by 


Remp («r) ~ R{olt) < 2(7 + 5) + O 



5 


Remark 3. In practice, we may resort to a formulation like 
GDT to encourage the mapping /„(•) to be near isometry 
in the local regions. We can understand GDT as with small 
S, resulting in small generalization error. This explains why 
GDT is more robust (Fig. [2] to [?]) than the metric learning 
formulation. 

Remark 4. In fact, GDT only partitions the X space into 
L subsets, implicitly assuming a trivial covering number of 
1. One could further partition within each classes, corre¬ 
sponding to a nontrivial covering number. However, this 
is at the cost of learning local neighborhoods within each 
class, which is beyond the scope of this paper. 


3. Experiments 


We provided a formal analysis in Section 2.3 to sup¬ 


port the proposed geometry-aware deep transform as a ro¬ 
bust framework for optimizing a deep network. In this 
section, we further present an experimental evaluation of 
GDT demonstrating its power in producing discriminative 
and robust features for classification. We compare GDT 
with two state-of-the-art deep learning objectives: Deep- 
Face (DF) tl7l and Deep Metric Learning (DML) (8). As 
discussed before, DeepFace shares attributes with our ped¬ 
agogic classification formulation, and DML is close to our 
pedagogic metric learning formulation. 


3.1. Illustrative example revisited 

We provide here more experimental evaluation using the 
illustrative example in Section[2] First we look at how A in¬ 
fluences the performance. The number of training samples 
per class ranges from 40 to 100. And A is varied in the [0,1] 
interval. 

Denote the training set as T and the testing set as V. In 
our case, the empirical loss on the training set is 


R, 


emp 


=— y 


J T 


i ,xj eT 


/c tr (xQ T / ctr (x J ) 

||/«r(Xi)|| • ||/c r (Xj)|| 



(13) 

where Zj- is the number of pairs constructed from the train¬ 
ing set. Note that the loss is not the objective in GDT 
formulation Q averaged over Zp; the objective of GDT 
incorporates an intra-class structure-preserving regulariza¬ 
tion, which should be excluded in evaluating the empirical 
loss. The expected loss is empirically evaluated over the 
testing set. 


R = 



Xi,Xj GV 


/c tT (x I ) T / ctr (x J ) 

||/« T (Xi)|| • ||/ ar (x i )|| 



( 14 ) 


where Z\> is the number of pairs constructed from the test¬ 
ing set. Here we use the notation R to indicate that it is an 
empirical estimate. 


Fig. 5a shows R emp and R for a variety of A and |T|. 


Note that the smaller A is, the more the structure-preserving 
regularization is emphasized. We observe that R emp is con¬ 
stantly lower than R, indicating that R emp always tends to 
be optimistic. As \T\ increases, R decreases and R emp ap¬ 
proaches R. Note that when |T| is small and A is big, R emp 
significantly underestimates R. Fig. 


5b 


shows an empirical 
estimate of the generalization error, R emp — R. Fixing a 
particular |T|, the generalization error decreases as A ap¬ 
proaches zero, implying more robustness. 

To see how the robustness influences classification, we 
apply a nearest neighbor (1-NN) classifier to the trans¬ 
formed testing data. The obtained classification accuracy 
is shown in Fig. [5c] When the number of training samples 
per class is small, there is a steady increase in classification 
accuracy as A decreases, i.e., when more structure preser¬ 
vation is enforced; and such increase becomes less obvious 
when the training set size increases. The above observation 
clearly shows that, when only a small training set is given, 
the robustness gained from the structure preservation domi¬ 
nates the classification performance. 

As discussed before, when A = 0, the objective func¬ 
tion is optimized for classification by imposing explicit con¬ 
straints, ti j = —1 for negative pairs, to separate different 
classes; however, due to the structure preservation, weak 
constraints are used to enforce similar representation for the 
same class. This drawback cannot be overlooked for appli¬ 
cations where it is critical to expect similar representations 
for the same class samples, such as face verification, and 
image retrieval. In the next section, we use face verifica¬ 
tion to demonstrate a scenario where the balance between 
robustness and discrimination is preferred. 


3.2. MNIST 


The last section shows an extreme case where the best 
classification performance is achieved when A = 0. How¬ 
ever, in general, R takes minimum at a nontrivial A £ (0,1), 
as illustrated in this section. We apply GDT to MNIST 
dataset. The /<*(•) we adopted is a neural network made up 
of 3 convolutional layers. Between every two consecutive 
convolutional layer is a pooling layer. The original 28 x 28 
images are mapped to 256 dimensional feature vectors. 

We vary A £ [0,1] and evaluate R emp on a small train¬ 
ing set of size 500 (50 samples per class). R is empirically 
estimated on testing set of size 10000. As shown in Fig. [6] 
we observe that as A varies from 0 to 1, the empirical loss 
keeps decreasing (Fig. [6a]), implying increasing discrimi¬ 
nation on training set. However, the generalization error 
keeps increasing (Fig.[6b|, implying decreasing robustness. 
Therefore, to achieve smallest R (corresponding to best per- 
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(a) R e mp and R 



(b) Generalization error 



Straining 


(c) 1 -NN classification accuracy 


Figure 5: Motivating example revisited. 


formance in testing set), we need to balance between dis¬ 
crimination and robustness. And in general, the R takes 


minimum at some A £ (0,1) (Fig. 6c i. 

As a comparison, we also ran LeNet on the same training 
set. The LeNet’s network structure is the same as the one 
adopted by GDT except that a fully connected layer and a 
softmax loss layer is added on the top. Fig.|6d|compares the 
classification accuracy of 1-nn on GDT features and that of 
LeNet. GDT’s accuracy constantly outperforms LeNet and 
peaks around A = 0.5 where R is the smallest. 


3.3. LFW 


clidean distance metric, and minimizes the loss defined in 
<©■ The function f„ (•) in ([3]i is implemented as a two-layer 
fully connected network with tanh as the squash function, 
and the same network structure is used for DF and DML. 
Weight decay (conventional Frobenius norm regularization) 
is adopted in both DF and DML. And a range of weight de¬ 
caying factor is tried and the best testing performance is 
reported. The network is trained on WDref and then ap¬ 
plied to the LFW. To reflect the discriminability of the trans¬ 
formed features, we only use a simple verification method, 
by comparing the cosine distance between a given face pair 
to a threshold. 


We further validate the effectiveness of the geometry- 
aware deep transform by performing face verification on 
the challenging LFW benchmark dataset j9j- Deep learning 
methods for face verification mostly use proprietary train¬ 
ing data mmm and are therefore not reproducible. We 
adopt the experimental setting from 11, and train a deep 
network on the WDRef dataset H' The WDRef dataset 
contains 2995 subjects and about 20 samples per subject, 
which is significantly smaller than a typical (proprietary) 
training set for deep learning, e.g., 4.4 million labeled faces 
from 4,030 people in ifTTl . or 202,599 face images from 10, 
177 subjects in 1131 . The goal of this paper is not to repro¬ 
duce the success of deep learning in face verification EE), 
but to compare the proposed GDT with several popular ob¬ 
jectives optimized in a deep network. In our experiment, 
each face is described using a high dimensional LBP feature 
0 available at 0 , which is reduced to dimension 5,000 us¬ 
ing PCA. 

We compare the proposed GDT with two state-of-the-art 
deep learning objectives: DeepFace (DF) El, and Deep 
Metric Learning (DML) 0. To enable a fair comparison, 
we adopt the same network structure and input features for 
all compared methods, but keep their respective objective 
functions. DF feeds the output of the last layer to a K- 
way soft-max to predict the probability distribution over K 
classes, and minimizes a softmax loss. DML uses the Eu¬ 


Table 1: Verification accuracy and AUC on LFW 


Method 

accuracy (%) 

AUC 

High-dim LBP 

74.73 

0.8222T0.01 

DF 

88.72 

0.9550±0.0029 

DML 

90.20 

0.9640±0.0027 

GDT 

91.72 

0.9724±0.0029 


The ROCs for all methods are reported in Fig. 7a Ver¬ 


ification accuracies and area under the ROC curves (AUC) 
are listed in Table [T] High-dim LBP denotes the original 
features before transform. DF optimizes for a classification 
objective, the softmax loss, and separates well samples from 
different classes; however, it enforces no explicit constraints 
to assign similar representations to the same class. DML 
enforces discriminative pairwise distance; but, as illustrated 
before, becomes less robust when restricted to a small train¬ 
ing set. As analyzed in Section 2.3 the proposed GDT is 
less conservative than DF for better discriminability; and, 
at the same time, expects smaller generalization errors than 
DML by preserving the local geometry ([3]). We observe 
that GDT outperforms both DF and DML by achieving a 
balance between discrimination and robustness. Face veri¬ 
fication accuracies are shown in Fig.[7b]by varying A from 
0.6 to 1; and peak accuracy is observed at A = 0.9, illus- 
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Figure 6: GDT on MNIST with very small training set 




Figure 7: Verification accuracy on LFW. 


trating the effectiveness of geometry preservation. Consid¬ 
ering the facts that both DF and DML are state-of-art deep 
learning methods, the improvements reported here clearly 
demonstrate the strength of GDT. 

We demonstrated here how the discriminability of origi¬ 
nal features, e.g., high-dim LBP here, can be improved with 
a learned feature transform. As emphasized, the goal is not 
to reproduce the success of deep learning in face verification 
(which can’t be done due to the lack of availability of the 
data used in the corresponding papers); thus, we perform 
verification by simply comparing the cosine distance be¬ 
tween each pair with a threshold. Note that more advanced 
verification techniques such as JointBciyes 03 can always be 
adopted for improved accuracies; for example, Q reports 
95.17% accuracy by applying the JointBayes method on the 
high-dim LBP features. As observed in oca, we also expect 
steady improvements in verification accuracy by increasing 
the number of subjects used in training a deep network. 

4. Conclusion 

We proposed a geometry-aware deep transform that uni¬ 
fies both the classification and metric learning objectives 


commonly optimized in learning a deep network. We pro¬ 
vided both experimental and theoretic illustrations to show 
that our method achieves a balance between discrimination 
and robustness, especially when restricted to a small train¬ 
ing set. We demonstrated the effectiveness of the proposed 
deep learning objective using real-world data for applica¬ 
tions such as face verification. 
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